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The last decade has seen the advent and consolidation of ontology based tools for the identification 
and biological interpretation of classes of genes, such as the Gene Ontology. The Gene Ontology 
is constantly evolving over time. The information accumulated time-by-time and included in the 
GO is encoded in the definition of terms and in the setting up of semantic relations amongst terms. 
This approach might be usefully complemented by a bottom-up approach based on the knowledge of 
relationships amongst genes. To this end, ee investigate the Gene Ontology from a complex network 
perspective. We consider the semantic network of terms naturally associated with the semantic 
relationships provided by the Gene Ontology consortium and a gene-based weighted network in 
which the nodes are the terms and a link between any two terms is weighted according to the number 
of genes that are annotated in both terms. One aim of the present paper is to understand whether 
the semantic and the gene-based network share the same structural properties or not. Indeed, we 
will show that the main differences between the two networks are in the number of links and not in 
the relative importance of the terms within the network. We then consider network communities. 
We show that in some cases GO terms that appear to be distinct from a semantic point of view are 
instead connected when considering their gene content. The identification of communities in the 
SVNs network can therefore be the basis of a simple protocol aiming at fully exploiting the possible 
relationships amongst terms, thus improving the semantic structure of GO. However, this is also 
important from a biomedical point of view, as it might reveal how genes over-expressed in a certain 
term also affect other biological processes, molecular functions and cellular components not directly 
linked by the GO semantics. As a by-product, we present a simple methodology that allows to have 
a first glance insight about the biological characterization of groups of GO terms. 

PACS numbers: 



I. INTRODUCTION 

The last decade has seen the advent and consolidation of ontology based tools for the identification and biological 
interpretation of classes of genes, such as the Gene Ontology (GO) [T]. Typically, ontologies allow for associating 
a gene to its biological functions and they also provide the information about the other genes which cooperate in 
performing such functions. As such, ontologies are a useful tool for exploiting the existence of sets of genes involved in 
a certain pathology. These instruments are alternative to the usual clustering methodologies, which are for example 
used in the analysis of gene expression profiles obtained from microarray experiments. One main difference is that 
the classes of genes obtained in an ontological analysis have a clear biological interpretation. 

The GO is constantly evolving [2^4] over time. The information accumulated time-by-time and included in the 
GO is encoded in the definition of terms and in the setting up of relations amongst terms. Thus a key point for the 
evolution and maintenance of GO is to set up protocols that are able to capture the relations amongst genes so that 
they can be profitably transferred at the level of terms. In fact, the GO is a controlled vocabulary where not only each 
terms is explained in some detail, but it is also given a set of relations between the terms. This semantic structure is 
mainly based on biological evidences relative to the functions described by the terms, i.e. on the knowledge of existing 
relations amongst biological functions. 

This is a top-down approach that we think might be usefully complemented by a bottom-up approach based on the 
direct knowledge of relationships amongst genes. Given a set of terms, one can set up a link between any two terms if 
they have a gene in common. Therefore, in parallel with the semantic network of terms naturally associated with the 
semantic relationships provided by GO, it is possible to generate a gene-based weighted network in which the nodes 
are the terms and a link between any two terms is weighted according to the number of genes that are annotated in 
both terms [5]. Moreover, it is possible to consider the bipartite system of GO terms and genes and investigate its 
properties by constructing and analyzing the projected network on one of the two sets. Here we will be interested on 
the projected network of terms. Recently a methodology has been proposed that identifies preferential links in the 
projected network [6j, i.e. links whose presence in the projected network cannot be explained in terms of a random 
co-occurrence of neighbors in the bipartite system. The resulting network where the existence of each link has been 
statistically tested against a null hypothesis of randomness is called statistically validated network (SVN). One aim 
of the present paper is to understand whether the semantic and the gene-based network share the same structural 
properties or not. Indeed, we will show that the scatter-plots of the betweenness versus the degree in both networks 
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are compatible with a parabolic dependence of the betweenness with respect to the degree, thus indicating that the 
main differences between the two networks are in the number of links and not in the relative importance of the terms 
within the network. 

Another way to compare the information encoded in the semantic GO structure with the one associated to the 
genes annotated in the terms is to investigate the network communities. In fact, once the gene-based network has 
been constructed, it can be partitioned in order to look for communities. In this case, communities are sets of GO 
terms that share a similar profile in terms of their annotated genes. The same can be done for the semantic network, 
thus obtaining sets of GO terms that share a similar profile in terms of their semantic relationships. Indeed, the 
idea of searching for communities or clusters of GO terms is not new. However, one usually looks for communities 
or clusters within the semantic GO network only [7l414|. Our approach is different. In fact, we use the information 
on the gene content of any GO term in order to create a statistically validated network of GO terms and then we 
partition it by using any standard community detection algorithm. As we will show below, it turns out that in some 
cases GO terms present in the same community of the gene-based network are not joined by any semantic link. This 
shows that terms that appear to be distinct from a semantic point of view are instead connected when considering 
their gene content. The identification of communities in the gene-based network can therefore be the basis of a simple 
protocol able to fully exploit the possible relationships amongst terms, thus improving the semantic structure of GO. 

As a by-product, we present a simple methodology that allows to have a first glance insight about the biological 
meaning of groups of GO terms. We have put on a statistical basis what any researcher first does when he obtains 
a list of GO terms that are somehow relevant in the analysis he is performing. The first thing to do is to read the 
definitions of the GO terms trying to figure out a possible "story" for the reason why the obtained terms are clustered 
together. We have devised a procedure that helps in this direction by providing the most relevant "words" of the 
"story" . 

The paper is organized as follows: in section |II A| we illustrate the data considered in our investigation while in 
section II B we will briefly review the methodologies that allows the generation of statistically validated networks and 
the cluster characterization. In section III we will study the semantic and gene-based network and show GO terms 
that belong to the same gene-based network community and are not joined by any semantic link. Our conclusions 
are drawn in section HVl 



II. DATA AND METHODOLOGY 

A. The Gene Ontology database 

We consider only the human part of the Gene Ontology. To this end we downloaded the 
gene_association.goa_human.gz [T5] file, release 1.224 with GOC validation date 20/02/2012. This is the file 
that accounts for the association between terms and genes. Together with that we have also downloaded the 
gene_ontology_ed.it. obo [TB] file, release 1.1.2667 with release date: 02/03/2012 09:20. This is the files that contains 
a description of the terms and the semantic links amongst them. Based on the semantic links we associated to a term 
all gene directly annotated in its children terms. As a result our system involves 12564 terms and 18992 genes. 



B. Statistically validated networks 



Many complex systems present an intrinsic bipartite nature and are often described and modeled in terms of 
networks. Bipartite networks are composed by two disjoint sets of nodes, say set A and set B, such that every link 
connects a node in the first set with a node of the second set. The bipartite network is often transformed by one-mode 
projecting, i.e. one creates a network of nodes belonging to one of the two sets and two nodes are connected when 
they have at least one common neighboring node of the other set. 

Bipartite networks are often very heterogeneous in the number of relationships that the elements of one set establish 
with the elements of the other set. When one constructs a projected network with nodes from only one set, the 
system heterogeneity makes it very difficult to identify preferential links between the elements. A new methodology to 
statistically validate each link of the projected network against a null hypothesis taking into account the heterogeneity 
of the system has been recently introduced [Hj. 

Let us consider two nodes A\ and A2 both belonging to A. Let _V_ be the number of elements in set B linked to 
node Ai and N 2 the number of elements in set B linked to node A 2 . The total number elements in set B is Ng and 
the observed number of elements in set B both linked to A\ and A 2 is Nyi- Under the null hypothesis of random 
co-occurrence, the probability of observing X co-occurrences of links both in A\ and A 2 is given by the hypergeometric 
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distribution [T7] 
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We can therefore associate a p- value to the observed N12 as: 



N 12 ~l 



P (N 12 ) = 1- H{X\N B ,Ni,N 2 ) 



(2) 



It is therefore possible to associate a p-value to each link in the projected network. After fixing a threshold one is 
thus able to select those links whose p-value is below the threshold. These links constitute the statistically validated 
network. Our working hypothesis is that such links are the most informative links in the original projected network 
because they are not compatible with a null hypothesis of randomness. 

The selection of the appropriate threshold is a key point. In fact, since the null hypothesis is tested for all links 
of the original projected network, we are in the typical situation when multiple test correction procedures must be 
applied. There are two possible correction procedure: the Bonferroni correction [THj and the FDR correction |19j . 
that is less restrictive. The Bonferroni correction is based on the consideration that if one tests N t either dependent 
or independent hypotheses on a set of data, then a conservative way of maintaining the error rate low is to test each 
individual hypothesis at a statistical significance level of Pt/N t , where pt is the chosen statistical threshold (1% in 
the present study) . The threshold of the FDR correction linearly increases with the number of tests in which the null 
hypothesis is rejected. 

When the heterogeneity in one of the two sets is large, the above approach can be modified as follows. Suppose one 
wants to validate the links in the set A projected network. If the heterogeneity in the elements of set B is high one 
can construct bipartite subsystems Sk of the original bipartite system S consists of all the Njg elements of set B with 
a given degree k and of all the elements from set A linked to them. By construction, a subsystem Sk is homogeneous 
with respect to the degree of elements belonging to set B. The methodology sketched above can therefore be applied 
to each subsystem Sk, thus obtaining a collection of statistically validated networks. In this case the p-value threshold 
must take into account that we are testing the null hypothesis for each subsystem and for each link in it. We then 
aggregate all validations by generating a statistically validated network where a link between node pairs is established 
whenever it has been validated in at least one subsystem. Such link can be given a weight equal to the total number 
of subsystems in which the link itself has been statistically validated. 

We refer to the network obtained by using the FDR correction for multiple comparisons as the FDR network. We 
refer to the network obtained by using the Bonferroni correction for multiple comparisons as the Bonferroni network. 
By construction, the Bonferroni network is a subgraph of the FDR network, which is a subgraph of the original 
projected adjacency network. A software to compute the Bonferroni and FDR network is available at the following 
web-sites: http: //ocs .unipa. it/validate . html and http : / / ocs .unipa. it/validate-k.html. 



Given a network, a first step in the understanding of the represented system is the identification of communities 
within the network. Communities are sets of nodes that are linked amongst them to a degree which is higher than the 
one expected on the basis of a null hypothesis of randomness |20j . This is equivalent to breaking down atomic particles 
in smaller pieces in order to understand what are their elementary constituents. This step however requires that these 
elementary constituents be given a name and their features are clearly stated. In other words, when communities are 
detected, then it is important to characterize them, i.e. to understand what are the main features that explain why 
nodes are grouped together in a community. 

A statistically robust methodology for the community characterization has been given in Ref. [21]. The main idea 
is to use attributes specific of the nodes involved in the cluster in order to see which attribute is most represented in 
the community. Suppose to have a community of K nodes. Suppose X out of K nodes are characterized by having a 
certain attribute A. Suppose that in a network of N elements the attribute A can be associated to M out of N nodes. 
Then the probability that X is observed by chance is given by the hypergeometric distribution: 
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FIG. 1: Partition of the semantic adjacency network by using the Infomap algorithm [22]. The left panel shows the different 
communities observed. The right panel shows a rank plot of the size of communities. 



For each attribute present in a community and for each community in the network we can therefore have a p-value. 
By considering the multiple hypothesis testing corrections illustrated above we can thus investigate what are the 
attributes that result to be over-expressed in a community. 

A software to perform the statistical characterization of communities within a network is available at the following 
web-site: http : //ocs . unipa . it/ characterize . html. 



The semantic structure of the Gene Ontology can be described in terms of an adjacency network where the nodes 
are the GO terms and the links between terms are provided by the semantic links of the gene_ontology_edit.obo file. 
When restricting to the human case, we get a network with N = 12564 nodes and L s = 116422 links. The semantic 
network is naturally partitioned in three large communities of size N\ = 8118, N 2 = 3336 and iV 3 = 1110. They 
correspond to the three main branches of GO: GO:0008150 (Biological Processes), GO:0005575 (Cellular Component) 
and GO:0003674 (Molecular Function). The number of links connecting terms inside the three branches are 85447, 
18489 and 12486 respectively. 

The network can be further partitioned by using a community detection algorithm [20] such as Infomap [55] . We 
performed 100 runs of the algorithm. In each run the algorithms performed 10 searches. We selected the partition with 
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the lowest code length. We thus obtained a partition of the adjacency semantic network in Np = 163 communities 
of size ranging from 1096 to 2, as shown in Fig. fl] (panel b). 



The partitioned network is shown in Fig. [T] (panel a). The list with the association between the GO terms and 
their community is given in Table SMI of the Supplementary Material. For example the 50 th largest community, 
of size 54, contains terms like GO:0001504 (neurotransmitter uptake) or GO:0007268 (synaptic transmission) which 
are homogeneous from a biological point of view. This points in the direction that the obtained communities are 
meaningful from a biological point of view. This is confirmed by performing a characterization of the clusters according 
to the methodology illustrated in Ref. [5T]. For each term in the whole AD J semantic network we first consider the 
words that define each term. After eliminating articles and prepositions we have 12855 distinct words. We then 
construct the 3-words obtained by concatenating together three consecutive words in the terms definition. For 
example, in the case of the GO term GO:0048518 (positive regulation of biological process) we get the two 3-words: 
positivejregulatiombiological and regulationJjiologicaLprocess. We have 34309 distinct 3-words in the whole network. 
We also delete the 3-words that appear up to two times within the whole network. We therefore have 10254 distinct 
3-words. We use these 3-words as attributes to the GO terms and statistically validate the over-expression of them 
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in each of the Np = 163 communities obtained by using the Infomap algorithm. All results are given in Table 
SM2 of the Supplementary Material. For the 50 th largest community mentioned above the get the following over- 



Ill. RESULTS AND DISCUSSION 



A. The semantic adjacency networks 
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Cluster 



size 



overexpressed 3-words attribute 



(SSU-rRNA_5.8S-rRNA 
5.8S.rRNA_LSU-rRNA ) 
sulfate_proteoglycan_biosynthetic 
proteoglycan-biosynthetic-process 
acid-Catabolic_process 
mRNA_catabolic_process 
acid_metabolic_process 
compound_metabolic_process 
acid-biosynthetic-process 



1096 
1096 
1096 
1096 
1096 
1096 
1096 
1096 
1096 



1043 
1043 
1043 
1043 



transport _vesicle_membrane 
endoplasmic_reticulum_membrane 
side_plasma_membrane 
ubiquitinJigase-Complex 



TABLE I: Statistical characterization of the communities of the semantic adjacency network. Communities have been obtained 
by using the Infomap algorithm [22] . The characterization has been done by using the methodology of Ref. [21] . The attributes 
of each term are the 3-words obtained by concatenating together three consecutive words in the terms definition. Only the 
results for the first two clusters are shown. The full list of characterizations is given in Table SMI of the Supplementary 
Material. 



represented 3-words: regulation_synapse -assembly and positive-regulation_synapse, thus confirming that the 50 
largest community groups together GO terms dealing with the regulation of the biological processes active at the level 
of synaptic transmission. As further examples, in Table [I] we show the results for the first two largest clusters. The 
first cluster seems to aggregate terms involved in the metabolic processes that break down acids into smaller units 
and the second cluster groups terms involved in the transport and protein targeting processes. 

We have therefore devised a simple methodology that allows to have a first glance insight about groups of GO 
terms. We have put on a statistical basis what any researcher first does when he obtains a list of GO terms that are 
somehow relevant in the analysis he is performing. The first thing to do is to read the definitions of the GO terms 
trying to figure out a possible "story" for the reason why the obtained terms are together. This procedure described 
above helps in this direction by providing the most relevant "words" of the "story" . 



B. The gene-based adjacency networks 

Let us now consider the GO structure that emerges when considering the gene content of the terms. For each term 
of the Gene Ontology we consider all genes annotated in it. We assign a gene to a certain term whenever the gene 
is either directly annotated in it or in any of its children. We thus have a bipartite system of genes and GO terms. 
In Fig. ([2]) (left panel) we show the number of genes assigned to any term and in Fig. ([2| (right panel) we show the 
number of terms a gene belongs to. The heterogeneity is quite large in both respects. There are terms containing 
over 10000 genes as well as there are gene containing only one gene. Analogously, there are genes present in over 500 
GO terms as well as there are genes present in just two terms. 



As mentioned in section II B starting from the bipartite system of GO terms and genes it is possible to construct 
two projected networks: the one of GO terms and the one of genes. The projected network of genes would be a network 
where nodes are genes and any two genes are connected if they belong to the same GO term. This would provide 
information about the links between genes based on their membership to biological processes, molecular functions 
and cellular components. For our purposes we will here consider the projected adjacency network of GO terms, i.e. 
we can generate the gene-based network where any two terms are connected by a link whenever there exists a least 
one gene assigned to both terms. Such adjacency network involves N — 12564 nodes and L g = 5142743 links. It is 
worth noticing that the number of links in this network is much larger than the number L s of semantic links. 

In Fig. [3] (left panel) we show the degree distribution [25] for the semantic (circles) and gene-based (triangles) 
networks. The profile of the two distributions is quite different, mainly due to the fact that terms in the gene-based 
network are much more linked to each other. This might be an indication of the fact that genes perform different 
tasks within different biological processes, molecular functions and cellular components. Moreover, it should also be 
noted that this might be an artifact of the fact that when generating the gene-based network a gene annotated in term 
T is also assigned to all terms that are parents of T. In Fig. [3] (right panel) we show the scatter-plot describing the 
relationships between degree and betweenness [35] for each term in the semantic (circles) and gene-based (triangles) 
networks. Contrary to the degree distribution, this panel shows that the two scatter-plots are quite similar. The solid 
line represents a reference curve y oc x 2 showing the both scatter-plots are compatible with a quadratic dependance 
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FIG. 2: For the GO terms/genes bipartite system, in the left panel we show the number of genes assigned to any term and in 
the right panel we show the number of terms a gene belongs to. 
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FIG. 3: The left panel shows a comparison between the degree distribution for the semantic (circles) and gene-based (triangles) 
networks. The right panel shows the scatter-plot describing the relationships between degree and betweenness for each term in 
the adjacency semantic (circles) and gene-based (triangles) networks. The solid line represent a reference curve y oc x 2 showing 
the both scatter-plots are compatible with a quadratic dependance of the betweenness from the degree. 



of the betweenness from the degree. These results suggest that the main difference between the two networks is in 
the number of links and not in the relative importance of the terms within the network. In this respect, it is worth 
mentioning that the average path length [25] of the semantic adjacency network is 1.997 while the average path length 
of the adjacency gene-based network is 1.935 that is very close. 

The gene-based network is fully connected [2"5] . Therefore even terms belonging to different branches of the GO 
(namely GO:0008150 (Biological Processes), GO:0005575 (Cellular Component) and GO:0003674 (Molecular Func- 
tion)) can be linked together when looking at their gene-content. The number of links between the 8118 terms of the 
BP branch is 3054021. The number of links between the 3336 terms of the MF branch is 106532 and the number of 
links between the 1110 terms of the MF branch is 62339, as summarized in Table |lT) The number of links connecting 
terms of the different branches is 1919851, i.e. 37% of the total number of links. The majority of links therefore 
connects terms of the BP branch although a relevant number of links also connects the BP branch with the others. 
The number of links in the gene-based network is larger than in the semantic network, and this is maintained when 
one restricts the analysis to the three main branches, see Table |TT| This probably explains why the average path length 
is on average slightly smaller in the gene-based network. It is perhaps surprising the fact that the difference in the 
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116422 
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12564 


5142743 


1.935 


BP 


8118 


85447 


1.997 


8118 


3054021 


1.907 


MF 


3336 


18489 


1.997 


3336 


106532 


1.981 


CC 


1110 


12486 


1.980 


1110 


62339 


1.899 



TABLE II: Basic metrics for the adjacency gene-based and semantic network. In the table we also show the results for the 
three main in GO branches, namely GO:0008150 (Biological Processes), GO:0005575 (Cellular Component) and GO:0003674 
(Molecular Function). These branches are disconnected in the semantic network, while they are linked in the gene-based 
network. APL stands for Average path Length. 

average path lengths is so small, despite the fact that the difference in the number of links is so large between the 
two networks. This might be an indication of redundancy: the relationships between biological processes mediated 
by genes might go through many different channels in order to ensure that a link between the terms is always active, 
despite possible impairments of some of the channels. 

C. The Bonferroni gene-based network 

As we mentioned above, the adjacency gene-based network is fully connected and shows an high level of redundancy. 
One might therefore ask what are the main links between terms. The answer can be given by considering statistically 
validated networks. Such networks only involve links that are statistically validated against a null hypothesis of 
randomness that takes into account the natural heterogeneity of the system. Therefore they can be considered as a 
tool for filtering relevant information out of the system, based on the terms gene content. 

Let us consider here the Bonferroni network of GO terms. Given the large heterogeneity in the genes set, we have 
constructed the validated network by adopting the validation procedure that involves the construction of subsystems 
where all elements have the same degree. The Bonferroni network of GO terms thus obtained involves 558 nodes and 
3508 links. It then involves the 4.75% of terms and the 0.08% of links with respect to the adjacency network. These 
numbers testify how large the reduction of both nodes and links can be when the statistical significance of a link is 
assessed by using a null hypothesis that properly takes into account the heterogeneity of the system [24 . A suggestive 
explanation of such large reduction is again redundancy: the fact that the adjacency gene-based network contains so 
many links compatible with a null hypothesis of randomness might be due to the need of creating as many channels of 
communication amongst terms as possible so that impairments have negligible impact on the system. In this respect, 
the Bonferroni network provides the core of the system. 

In Fig. Q (left panel) we show the degree [25] of the nodes present in the Bonferroni network. The most connected 
nodes have degree values over 50. This means that despite the large reduction of nodes with respect to the adjacency 
network, there are still nodes that behave like hubs. In Fig. Q (right panel) we show the scatter-plot describing the 
relationships between degree and betweenness |25j for each term in the Bonferroni network. When comparing this 
figure with Fig. ^ (panel b) the severe cut in the number of nodes and links becomes evident. This is confirmed 
when computing the average path length [25] for the Bonferroni network that amounts to 4.017. 

The Bonferroni network of GO terms is naturally partitioned in 30 clusters. The largest cluster has size 467. The 
second largest has size 8, thus indicating the presence of a giant component. We also partitioned the network by using 
the Infomap algorithm [22]. As a result, we get a partition involving Np = 74 clusters whose size ranges from 2 to 
40, as shown in Fig. [5] 

D. Community characterization of the Bonferroni gene-based network 

We can therefore compare the information encoded in the semantic GO structure with the one associated to 
the genes annotated in the terms at the level of network communities. In fact, each of the above clusters groups 
together GO terms on the bases of their gene content. However, any GO terms T in a gene-based cluster C carries 
its semantic information inherited from the general GO semantic structure. We want to bring together these two 
levels of information by using the semantic information of any GO term T to characterize the gene-based cluster 
C that includes T. The idea is to use the methodology of Ref. [3T] and to use the semantic information of T as 
attribute for the characterization of gene-based cluster C. Specifically, we will use the semantic information obtained 
in the Infomap partitioning of the adjacency semantic network. We remember that in section |III A| we obtained a 
partitioning of the adjacency semantic network in communities that are homogeneous from a biological point of view. 
Thus, we here considered as attributes the clusters Ai,i = 1, • • • , 163 of Fig. [I] Each cluster C of GO terms in the 
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FIG. 4: The left panel shows the degree rank plot for the Bonferroni gene-based network. The right panel shows the scatter-plot 
describing the relationships between degree and betweenness for each term in the the Bonferroni gene-based network. 




FIG. 5: Partition of the gene-based bonferroni network by using the Infomap algorithm [22]. The left panel shows the different 
communities observed. The right panel shows a rank plot of the size of communities. 



gene-based Bonferroni network can be therefore characterized in terms of 163 attributes. In Table [ITT] we report the 
over-expressions for the first largest clusters, with size greater than 20. The full list of characterizations is given in 
Table SM3 of the Supplementary Material. 

One first thing to be noticed is that clusters can be characterized in terms of more than one attribute. Moreover, 
Table III shows that in the same clusters of the gene-based Bonferroni network one can have terms that belong to 
two different GO branches. These examples show how there exist GO terms that have no semantic link between each 
other and can nevertheless be put in connection when the gene content of their children terms is considered. In other 
words, if we think to the attributes as defining homogeneous and distinct subsets of the GO, such subsets might be 
disconnected from a semantic point of view and connected when the gene content of their terms is taken into account. 



IV. CONCLUSION 



In this paper we have considered the Gene Ontology from a complex network perspective. We have considered two 
types of GO networks: the one associated to the semantic structure of the Gene ontology and the one associated to 
the gene content of each GO term. 

We have first considered some basic network metrics showing that the main differences between the two networks 
are in the number of links and not in the relative importance of the terms within the network. 
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11 
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11 
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TABLE III: Statistical characterization of the communities of the Bonferroni gene-based network. Communities have been 
obtained by using the Infomap algorithm |22| . The characterization has been done by using the methodology of Ref. [21] . 
For each term T we have considered as attribute the membership of T to one of the communities of the adjacency semantic 
network. Only the results for the five largest clusters are shown. The full list of characterizations is given in Table SM3 of the 
Supplementary Material. 

We have then compared the two networks at the level of network communities. In the adjacency semantic network 
we detected 163 communities by using the Infomap algorithm. Such communities are homogeneous from a biological 
point of view. This has been confirmed by a statistical analysis able to provide the 3-words that are over-expressed in 
the obtained clusters. In the Bonferroni gene-based network we detected 74 small communities by using the Infomap 
algorithm. Our results show that there exist GO terms that have no semantic link between each other and can 
nevertheless be put in connection when the gene content of their children terms is considered. 

We believe that the importance of our results is twofold. On one side we have devised a simple methodology, 
based on the detection of the statistical significant 3-words, that allows to have a first glance insight about the 
biological meaning of groups of GO terms. This might become a routinary tool able to provide a first-glance biological 
interpretation of groups of GO terms. On the other side we have shown that a deeper analysis of GO terms, based 
on their gene content, might reveal relationships between terms that are missed by looking at the semantic structure 
of GO. This has a practical importance for the evolution and maintenance of GO. In fact this kind of analysis can be 
useful to capture the relations amongst genes to be profitably transferred at the level of terms. However, this is also 
important from a biomedical point of view, as it might reveal how genes over-expressed in a certain term also affect 
other biological processes, molecular functions and cellular components not directly linked by the GO semantics. 
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